|
A ''focused crawler''〔Soumen Chakrabarti, (Focused Web Crawling ), in the (Encyclopedia of Database Systems ).〕 is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the (crawl frontier ) and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages with large PageRank", or "crawl pages about baseball". An important page property pertains to topics, leading to ''topical crawlers''. For example, a topical crawler may be deployed to collect pages about solar power, or swine flu, while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, an Web text index, backlinks, or any other Web artifact. A focused crawler must predict the probability that an unvisited page will be relevant before actually downloading the page.〔(Improving the Performance of Focused Web Crawlers ), Sotiris Batsakis, Euripides G. M. Petrakis, Evangelos Milios, 2012-04-09〕 A possible predictor is the anchor text of links; this was the approach taken by Pinkerton 〔Pinkerton, B. (1994). (Finding what people want: Experiences with the WebCrawler ). In Proceedings of the First World Wide Web Conference, Geneva, Switzerland.〕 in a crawler developed in the early days of the Web. Topical crawling was first introduced by Filippo Menczer〔Menczer, F. (1997). (ARACHNID: Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery ). In D. Fisher, ed., Proceedings of the 14th International Conference on Machine Learning (ICML97). Morgan Kaufmann.〕〔Menczer, F. and Belew, R.K. (1998). (Adaptive Information Agents in Distributed Textual Environments ). In K. Sycara and M. Wooldridge (eds.) Proceedings of the 2nd International Conference on Autonomous Agents (Agents '98). ACM Press.〕 Chakrabarti ''et al.'' coined the term ''focused crawler'' and used a text classifier〔(Focused crawling: a new approach to topic-specific Web resource discovery ), Soumen Chakrabarti, Martin van den Berg and Byron Dom, WWW 1999.〕 to prioritize the crawl frontier. Andrew McCallum and co-authors also used reinforcement learning〔(A machine learning approach to building domain-specific search engines ), Andrew McCallum, Kamal Nigam, Jason Rennie, and Kristie Seymore, IJCAI 1999.〕〔(Using Reinforcement Learning to Spider the Web Efficiently ), Jason Rennie and Andrew McCallum, ICML 1999.〕 to focus crawlers. Diligenti et al.'' traced the context graph〔Diligenti, M., Coetzee, F., Lawrence, S., Giles, C. L., and Gori, M. (2000). (Focused crawling using context graphs ). In Proceedings of the 26th International Conference on Very Large Databases (VLDB), pages 527-534, Cairo, Egypt.〕 leading up to relevant pages, and their text content, to train classifiers. A form of online reinforcement learning has been used along with features extracted from the DOM tree and text of linking pages, to continually train〔(Accelerated focused crawling through online relevance feedback ), Soumen Chakrabarti, Kunal Punera, and Mallela Subramanyam, WWW 2002.〕 classifiers that guide the crawl. In a review of topical crawling algorithms, Menczer ''et al.'' 〔Menczer, F., Pant, G., and Srinivasan, P. (2004). (Topical Web Crawlers: Evaluating Adaptive Algorithms ). ACM Trans. on Internet Technology 4(4): 378–419.〕 show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. Another type of focused crawlers is semantic focused crawler, which makes use of domain ontologies to represent topical maps and link Web pages with relevant ontological concepts for the selection and categorization purposes.〔Dong, H., Hussain, F.K., Chang, E.: (State of the art in semantic focused crawlers ). Computational Science and Its Applications – ICCSA 2009. Springer-Verlag, Seoul, Korea (July 2009) pp. 910-924〕 In addition, ontologies can be automatically updated in the crawling process. Dong et al. 〔 Dong, H., Hussain, F.K.: (SOF: A semi-supervised ontology-learning-based focused crawler. ) Concurrency and Computation: Practice and Experience. 25(12) (August 2013) pp. 1623-1812〕introduced such an ontology-learning-based crawler using support vector machine to update the content of ontological concepts when crawling Web Pages. Crawlers are also focused on page properties other than topics. Cho ''et al.''〔Junghoo Cho, Hector Garcia-Molina, Lawrence Page: (Efficient Crawling Through URL Ordering ). Computer Networks 30(1-7): 161-172 (1998)〕 study a variety of crawl prioritization policies and their effects on the link popularity of fetched pages. Najork and Weiner〔Marc Najork, Janet L. Wiener: (Breadth-first crawling yields high-quality pages ). WWW 2001: 114-118〕 show that breadth-first crawling, starting from popular seed pages, leads to collecting large-PageRank pages early in the crawl. Refinements involving detection of stale (poorly maintained) pages have been reported by Eiron ''et al''.〔Nadav Eiron, Kevin S. McCurley, John A. Tomlin: (Ranking the web frontier ). WWW 2004: 309-318.〕 The performance of a focused crawler depends on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points. Davison〔Brian D. Davison: (Topical locality in the Web ). SIGIR 2000: 272-279.〕 presented studies on Web links and text that explain why focused crawling succeeds on broad topics; similar studies were presented by Chakrabarti ''et al''.〔Soumen Chakrabarti, Mukul Joshi, Kunal Punera, David M. Pennock: (The structure of broad topics on the Web ). WWW 2002: 251-262.〕 Seed selection can be important for focused crawlers and significantly influence the crawling efficiency.〔Jian Wu, Pradeep Teregowda, Juan Pablo Fernández Ramírez, Prasenjit Mitra, Shuyi Zheng, C. Lee Giles, (The evolution of a crawling strategy for an academic document search engine: whitelists and blacklists ), In proceedings of the 3rd Annual ACM Web Science Conference Pages 340-343, Evanston, IL, USA, June 2012.〕 A whitelist strategy is to start the focus crawl from a list of high quality seed URLs and limit the crawling scope to the domains of these URLs. These high quality seeds should be selected based on a list of URL candidates which are accumulated over a sufficient long period of general web crawling. The whitelist should be updated periodically after it is created. == References == 抄文引用元・出典: フリー百科事典『 ウィキペディア(Wikipedia)』 ■ウィキペディアで「Focused crawler」の詳細全文を読む スポンサード リンク
|